Zero-shot learning (ZSL) models rely on learning a joint embedding spacewhere both textual/semantic description of object classes and visualrepresentation of object images can be projected to for nearest neighboursearch. Despite the success of deep neural networks that learn an end-to-endmodel between text and images in other vision problems such as imagecaptioning, very few deep ZSL model exists and they show little advantage overZSL models that utilise deep feature representations but do not learn anend-to-end embedding. In this paper we argue that the key to make deep ZSLmodels succeed is to choose the right embedding space. Instead of embeddinginto a semantic space or an intermediate space, we propose to use the visualspace as the embedding space. This is because that in this space, thesubsequent nearest neighbour search would suffer much less from the hubnessproblem and thus become more effective. This model design also provides anatural mechanism for multiple semantic modalities (e.g., attributes andsentence descriptions) to be fused and optimised jointly in an end-to-endmanner. Extensive experiments on four benchmarks show that our modelsignificantly outperforms the existing models.
展开▼